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ABSTRACT 

I describe a Bayesian method to account for measurement errors in linear regression of as- 
tronomical data. The method allows for heteroscedastic and possibly correlated measurement 
errors, and intrinsic scatter in the regression relationship. The method is based on deriving a 
likelihood function for the measured data, and I focus on the case when the intrinsic distribution 
of the independent variables can be approximated using a mixture of Gaussians. I generalize 
the method to incorporate multiple independent variables, non-detections, and selection effects 
(e.g., Malmquist bias). A Gibbs sampler is described for simulating random draws from the 
probability distribution of the parameters, given the observed data. I use simulation to com- 
pare the method with other common estimators. The simulations illustrate that the Gaussian 
mixture model outperforms other common estimators and can effectively give constraints on the 
regression parameters, even when the measurement errors dominate the observed scatter, source 
detection fraction is low, or the intrinsic distribution of the independent variables is not a mixture 
of Gaussians. I conclude by using this method to fit the X-ray spectral slope as a function of 
Eddington ratio using a sample of 39 2; < 0.8 radio-quiet quasars. I confirm the correlation seen 
by other authors between the radio-quiet quasar X-ray spectral slope and the Eddington ratio, 
where the X-ray spectral slope softens as the Eddington ratio increases. IDL routines are made 
available for performing the regression. 

Subject headings: methods; data analysis — methods: numerical — methods: statistical 



1. INTRODUCTION 

Linear regression is one of the most common statistical techniques used in astronomical data analysis. 
In general, linear regression in astronomy is characterized by intrinsic scatter about the regression line, 
and measurement errors in both the independent and dependent variables. The source of intrinsic scatter is 
variations in the physical properties of astronomical sources that arc not completely captured by the variables 
included in the regression. It is important to correctly account for both measurement error and intrinsic 
scatter, as both aspects can have a non-negligible effect on the regression results. In particular, ignoring the 
intrinsic scatter and weighting the data points solely by the measurement errors can result in the higher- 
precision measurements being given disproportionate influence on the regression results. Furthermore, when 
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independent variables, measurement error can have an even stronger and more unpredictable effect (|Fox 



19971). In addition, the existence of non-detections , referred to as 'censored data', in the data set will result 



additional complications (e.g., Isobe et al. 1986h . Therefore, when performing regression, it is essential to 
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correctly account for the measurement errors and intrinsic scatter in order to ensure that the data analysis, 
and thus the scientific conclusions based on it, are trustworthy. 

Many methods have been proposed for performing linear regression when intrinsic scatter is present 
and both variables are measured with error. These inc l ude methods th at correct the observed moments 
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statistic (e.g., Clutton-Brock 1967 : Barker fc Diana 1974; Press et al. 19921 : Tremaine et al. 2002 ). assume 
a pr obability dist r ibution for the true in dependent variable values (so-called 'structural equation models', 
e.g., ISchaf( 
oped (e.g., IZellner 
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Roy fc Baneriegi2006): Bayesian approaches to these models have also been deve l 



Gul]||l989HDellaportas fc StephensHl995l : ICarroll et aDll999l : IScheines et allll999l ). 



In addition, methods have been pro posed to account for measurement error in censored regression (e.g., 
IStapleton" fc Young"l984'; 'Weis!j| l993h . The most commonly use d methods i n astronomy are the BCES esti- 
mator (Akritas fc Bershady 199^ and the 'FITEXY' estimator ( Press et aL 1992) . Both method s have their 



advantages and disadvantages, some of which have been pointed out by 
neither method is applicable when the data contain non-detections. 



Tremaine et al 



(2003). 
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In this work I describe a Bayesian method for handling measurement errors in astronomical data anal- 
ysis. My approach starts by computing the likelihood function of the complete data, i.e., the likelihood 
function of both the unobserved true values of the data and the measured values of the data. The measured 
data likelihood is then found by integrating t he likelihood functi on for the complete data over the unob- 
served true values (e.g.. Little fc Rubin 20021 : Gelman et al. 2004). This approach is known as 'structural 
equat ion modelling' of measurement e rror problems, and has b een studied from both a frequentist approach 
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1987l:jcarroll et al.lll99d: [Sch afcr 2001; Aitken fc Roccil I2OO2I) and a Bayesian approach (e.g. 

Miiller fc Roedei] Il997t iRichardson fc Leblondl Il997t iRichardson et al.ll2002[) . In this work, I extend the 



statistical model of ICarroll et al.l (jl999r ) to allow for measurement errors of different magnitudes (i.e., 'het- 
eroscedastic' errors), non-detections, and selection effects, so long as the selection function can be modelled 
mathematically. Our method models the distribution of independent variables as a weighted sum of Gaus- 
sians. The mixture of Gaussians model allows flexibility when estimating the distribution of t he true values 



of th e independent variable, thus increasing its robustness against model mispecification fe.g.. iHuang et al 



20061 ) . The basic idea is that one can use a suitably large enough number of Gaussians to accurately approx- 
imate the true distribution of independent variables, even though in general the individual Gaussians have 
no physical meaning. 

The paper is organized as follows. In §[5] we summarize some notation, and in §[3] I review the effects of 
measurement error on the estimates for the regression slope and correlation cocfhcicnt. In §[4] I describe the 
statistical model and derive the likelihood functions, and in §[5] I describe how to incorporate knowledge of 
the selection effects and account for non-dctcctions. In S 16. Il l describe the prior distribution for this model, 
and in 16.21 1 describe a Gibbs sampler for sampling from the posterior distributions. In §[7] I use simulation 
to illustrate the effectiveness of this structural model and compare with the OLS, BCES(F|X), and FITEXY 
estimators. Finally, in § [8] I illustrate the method using astronomical data by performing a regression of 
the X-ray photon index, Tx, on the Eddington ratio using a sample of 39 z < 0.83 radio-quiet quasars. 
Sections SI \5\ and [6] are somewhat technical, and the reader who is uninterested in the mathematical and 
computational details may skip to them. 
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2. NOTATION 

I will use the common statistical notation that an estimate of a quantity is denoted by placing a 'hat' 
above it; e.g., 9 is an estimate of the true value of the parameter 0. In general, greek letters will denote the 
true value of a quantity, while roman letters will denote the contaminated measured value. I will frequently 
refer to the 'bias' of an estimator. The bias of an estimator is E{9) — 9o, where E{9) is the expectation value 
of the estimator 9, and 9o is the true value of 9. An unbiased estimator is one such that E{9) = 9a. 

I will denote a normal density with mean /i and variance as A(/i, a'^), and I will denote as Np{^, S) 
a multivariate normal density with p-element mean vector /i and p x p covariance matrix S. If I want to 
explicitly identify the argument of the Gaussian function, I will use the notation N{x\^, a'^), which should 
be un derstood to be a Gaussian with mean /i and variance as a function of x. Following iGelman et all 



(|2004l ). I denote the scaled inverse-x^ density as Inv-x^(j/, s^), where v is the degrees of freedom and 
is the scale parameter, and we denote the inverse- Wishart as Inv-Wisharti,(S'), where i/ is the degrees of 
freedom and S is the scale matrix. The inverse- Wishart distribution can be thought of as a multivariate 
generalization of the scaled inverse-x^ distribution. I will often use the common statistical notation where 
"~" means "is drawn from" or "is distributed as". For example, x ~ N{fi,a'^) states that x is drawn from 
a normal density with mean fj, and variance a^. 



EFFECT OF MEASUREMENT ERROR ON CORRELATION AND REGRESSION 

It is well kn own that measurement error can attenuate the estimate of the regression slope and correlation 



coefficient (e.g.. iFullenllQSTt lFoxlll997l ). For completeness, I give a brief review of the effect of measurement 



error on correlation and regression analysis for the case of one independent variable. 

Denote the independent variable as ^ and the dependent variable as rj; ^ and rj are also referred to as 
the 'covariate' and the 'response', respectively. I assume that ^ is a random vector of n data points drawn 
from some probability distribution. The dependent variable, rj, depends on ^ according to the usual additive 
model: 

n,=a + (3C^+e, (1) 

Here, is a random variable representing the intrinsic scatter in rji about the regression relationship, and 
(a, /3) are the regression coefficients. The mean of e is assumed to be zero, and the variance of e is assumed 
to be constant and is denoted as cr^. We do not observe the actual values of {£,,r]), but instead observe values 
(x, y) which are measured with error. The measured values are assumed to be related to the actual values as 

'^i C« ^" ^x^i (2) 

Vi = Vi + ^y,i, (3) 



where ex,i and e^.i are the random measurement errors on Xi and yi, respectively. In general, the errors are 
normally distributed with known variances cr^ ^ and (Jy i, and covariance Uxy,i- For simplicity, throughout 

.2 



the rest of this section I assume that cr^, ct^, and Uxy are the same for each data point 



When the data are measured without error, the least-squares estimate of the regression slope, Pols, 
and the estimated correlation coefficient, p, are 
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Here, Cov{^, if) is the sample covariance between ^ and 77, and Var[(^) is the sample variance of ^. When the 
data are measured with error, the least-squares estimate of the regression slope, boLSy the estimated 
correlation coefRcient, f, become 

t ^ Cov{x,y) ^ Cov{^,T]) +aa:y 

^^■^ Varix) Var{^)+al ^' 



Cov{x,y) _ Cov{£,,T]) 



'xy 



y^Var{x)Var{y) ^{Var{Cj + (jl){Var{T]) + a^) 



(7) 



From these equations it is apparent that the estimated slope and correlation are biased when the data are 
measured with error. 

It is informative to assess the effect of measurement error in terms of the ratios Rx — a^jV ar(x\ Ry = 
ay/Var{y), Rxy = (7xy/Cov{x,y), as these quantities can be calculated from the data. The fractional bias 
in the estimated slope and correlation may then be expressed as 

6 ^ l-Rx 

P l-Rxy ^ ' 



r ^ ^{1-Rx){l-Ry) 

p 1 - Rxy 

From Equations ([8]) and ([9]) it is apparent that measurement errors have the following effects. First, covariate 
measurement error reduces the magnitude of the observed correlation between the independent variable and 
the response, as well as biasing the estimate of the slope towards zero. Second, measurement error in 
the response also reduces the magnitude of the observed correlation between the variables. Third, if the 
measurement errors are correlated the effects depend on the sign of this correlation. If the measurement 
error correlation has the same sign as the intrinsic correlation between ^ and 77, then the measurement errors 
cause a spurious increase in the observed correlation; otherwise the measurement errors cause a spurious 
decrease in the observed correlation. The magnitude of these effects depend on how large the measurement 
errors are compared to the observed variance in x and y. 

In Figure [1] I plot the fractional bias in the correlation coefficient, {p ~ f)/p, as a function of R^ and 
Ry when the errors are uncorrelated. As can be seen, measurement error can have a significant effect 
on the estimation of the linear correlation coefficient. For example, when Rx ~ 0.5 and Ry « 0.5, the 
estimated correlation is « 50% lower than the true correlation. Therefore, interpretation of correlation 
coefficients and regression slopes must be approached with caution when the data have been contaminated 
by measurement error. To ensure accurate results, it is necessary to employ statistical methods that correct 
for the measurement errors. 



4. THE STATISTICAL MODEL 
4.1. Regression with One Independent Variable 



I assume that the independent variable, ^, is drawn from a probability distribution p{(,\'ip), where tp 
denotes the parameters for this distribution. The dependent variable is then drawn from the conditional 
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distribution of rj given ^, denoted as p(?7|^,^); denotes the parameters for this distribution. The joint 
distribution of ^ and i] is then p{(^,ri\ip,9) = In this work I assume the normal hnear 

regression model given by Equation and thus p{'ri\S,,9) is a normal density with mean a + and 
variance ct^, and 9 = (a, /3, fi^). 

Since the data are a randomly observed sample, we can derive the likelihood function for the measured 
data. The likelihood function of the measured dat a, p(x,y\9,il'), is obtained by integrating the complete 
data likelihood over the missing data, ^ and rj (e.g., Little fc Rubin 20021 : Gelman et al. 2004): 



p{x, y\e, i')^ J J Pix, y. V-) dr^. (10) 

Here, p(x, y, ^, ri\9, ip) is the complete data likelihood function. Because of the hierarchical structure inherent 
in the measurement error model, it is helpful to decompose the complete data likelihood into conditional 
probability densities: 

p{x,y\9,ilj) ^ p{x,y\tri)p{'q\^,9)p{^\'ilj) dr]. (11) 



The density p(x,y\£,,r]) describes the joint distribution of the measured values x and y at a given ^ and 
r], and depends on the assumed distribution of the measurement errors, and Cy. In this work I assume 
Gaussian measurement error, and thus p{xi, yi\(,i,rii) is a multivariate normal density with mean {£,i,rii) and 
covariance matrix S^, where Sn^i = cr^ ^, ^22.i = i, and Si2.i = cr-xyA- The statistical model may then be 
conveniently expressed hierarchically as 

6 ^ vim (12) 

77,|f, ~ Nia + pi,,^'') (13) 

y^,Xi\7]i,^,^ ^ N2{[rii,ii],Y.i) (14) 

Note that if Xi is measured without error, then p{xi\^i) is a Dirac delta function, and p(xi, y^j^i, 77^) = 
p{yi\Vi)^i^i ~ Ci)- An equivalent result holds if yi is measured without error. 

Equation pi|) may be used to obtain the observed data likelihood function for any assumed distribution 
of ^. In this work, I model p{£,\'ip) as a mixture of K Gaussians, 

fc=l V^TTTj^ I Z J 

where X^a^i '"'fc ~ 1- Note that, TTfe may be interpreted as the probability of drawing a data point from the fc"^ 
Gaussian. I will use the convenient notation tt = (tti, . . . , ttk), M = (/^i, • ■ ■ , M-ftr)i and = (r^, . . . , t^); note 
that tp — (tt, /i, r^). It is useful to model p(^|'(/') using this form because it is flexible enough to adapt to a wide 
variety of distributions, but is also conjugate for the regression relationship (Eq.[T]) and the measurement 
error distribution, thus simplifying the mathematics. 

Assuming the Gaussian mixture model for pi^lip), the measured data likelihood for the i^^ data point 
can be directly calculated using Equation (lll|l . Denoting the measured data as z = (y, x), the measured data 
likelihood function for the i^^ data point is then a mixture of bivariate normal distributions with weights 
TT, means ( = (Ci, . . . , ), and covariance matrices Vi = (Vi^i, . . . ,VK,i)- Because the data points are 
statistically independent, the full measured data likelihood is then the product of the likelihood functions 
for the individual data points: 

n K , 
p{x,y\9,^) = m2^t^eycpl--{z,~afV,-l{z.-Ck)\ (16) 
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Ck = (a + (3fik,fJ.k) (17) 

Here, denotes the transpose of z. Equation p6)) may be maximized to compu te the maximum-likelihood 
estim ate (MLE). When K > I, the expectation-maximization (E M) algorithm ({ Dem pster. Laird. &: Rubin 
1977 ) is probably the most efficient tool for calculating the MLE. Rov fc Baner icc (200^ describe an EM 



algorithm when p(^) is assumed to be a mixture of normals and the measurement error distribution is 
multivariate t, and their results can be extended to the statistical model described in this work. 

It is informative to decompose the measured data likelihood, p{xi, yi\0, 1^) = p{yi\xi, 9, ip)p{xi\'ip), as this 
representation is useful when the data contain non-detections (cf. , § 15. 2p . The marginal distribution of Xi is 

f \i\ ^fc J 1 {xj - ^ikf \ , . 

p{x^ij)^2^ =exp<-- ^, (19) 

and the conditional distribution of yi given Xi is 

Piy.\x,,e,^) = y , exp(-l [^'-^/":'"-,^)]' l (20) 

^ ' ^ t{V^^Var{y,\x,,k) \ 2 Variy.\x.,k) j ^ ' 

^TkN{x^\^lk,Tl+al^) 

E{yi\x^,k) = a + { 2,2 — + 2 , 9 Mfc (22) 

\ '''fc + '^x,i ) \ '''fc + ^x,i ) 

Var{y,\x,,k) = + a' + - ^-^^^^^ . (23) 

'''fc + '^x,i 

Here, 7^ can be interpreted as the probability that the i*^ data point was drawn from the k^^ Gaussian 
given Xi, E{yi\xi, k) gives the expectation value of yi at Xi, given that the data point was drawn from the 
k^^ Gaussian, and Var{yi\xi, k) gives the variance in yi at Xi, given that the data point was drawn from the 
fc**^ Gaussian. 



4.2. Relationship between Uniformly Distributed Covariates and Effective Estimators 

It is informative to investigate the case where the distribution of ^ is assumed to be uniform, p(^) oc 1. 
Interpreting p{^) as a 'prior' on ^, one may be tempted to consider assuming p(^) oc 1 as a more objective 
alternative to the normal distribution. A uniform distribution for ^ may be obtained as the limit — > 00, 
and thus the likelihood function for p(^) cx 1 can be calculated from Equation ((20)) by taking — ^ 00 and 
K ~ 1. When the measurement errors arc uncorrelated, the likelihood for uniform p(^) is 



Piy\x,9)^ll 

1=1 



1 



27r(cr2 



x.t/ 



: CXp ■ 



1 (y. 



I3x,y 



2a2 



2 

xA 



(24) 



The a rgument of the exponential is the FITEXY goodness of fit statistic, y f w , SuS modified b y 



Tremaine et al 



(2002) to account for intrinsic scatter; this fact has also been recognized by Weiner et al. ( 2006[) . Despit' 
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this connection, minimizing Xexy ^'^^ ^^'^ same as maximizing the conditional likelihood of y given x, as 
both P and cr^ appear in the normalization of the likelihood function as well. 

For a given value of , minimizing Xexy '^^^ interpreted as minimizing a weighted sum of squared 
errors, where the weights are given by the variances in yi at a given Xi, and one assumes a uniform distribution 
for ^. Unfortunately, this is only valid for a fixed value of a^. Moreover, little is known about t he statistical 
properties of the FITEXY estimator, such as its bias and variance, although bootstrapping (e.g., Efronl[l97^ 



Davison fc Hinklevlll997^ may be used to estimate them. Furthermore, it is ambiguous how to calculate the 
FITEXY estimates when there is an intrinsic scatter term. The FITEXY goodness-of-fit statistic, Xexy^ 
cannot be simultaneously minimized with respect to a, and cr^, as Xexy ^ strictly decreasing function 
of (7^. As such, it is unclear how to pro ceed in the optimizatio n beyond an ad hoc approach. Many authors 
have followed the approach adopted bv lTremaine et al.l (j2002r ) and increase cr^ until x'sxY/i''^ ~ 2) = 1, or 
assume cr^ = if Xbxf/C"- — 2) < 1. 

Despite the fact that minimizing x'exy ^^^^ same as maximizing Equation (j24[) . one may still be 
tempted to calculate a MLE based on Equation ([24| . However, it can be shown that if one assumes p(^) oc 1, 
and if all of the x and y have the same respective measurem ent error varia nces, cr^ and , the MLE estimates 
for a and /? are just the ordinary least squares estimates (jZellneii 119711 ). While this is not necessarily true 
when the magnitudes of the measurement errors vary between data points, one might expect that the MLE 
will behave similarly to the OLS estimate. I confirm this fact using simulation in § 17.11 Unfortunately, 
this implies that the MLE for p ( £) oc 1 inherits the bias in the OLS estimate, and thus nothing is gained. 
Furthermore, as arg ued byEuU (|l989l ). one can easily be convinced that assuming p(^) oc 1 is incorrect by 
examining a histogram of x. 



4.3. Regression with Multiple Independent Variables 



The formalism developed in § 14.11 can easily be generalized to multiple independent variables. In this 
case Equation ^ becomes 

Tj, = a + (3^Ci + <^i, (25) 
where P is now a p-element vector and is a p-element vector containing the values of the independent 
variables for the i"^ data point. Similar to before, we assume that the distribution of can be approximated 
using a mixture of K multivariate normal densities with p-element mean vectors /i = (/ii, . . . , /i/f ), p x p 
covariance matrices T — (Ti, . . . ,Tx), and weights tt — (tti, . . . jTTk)- The measured value of is the p- 
element vector x^, and the Gaussian measurement errors on {yi,Xi) have {p+1) x {p+ 1) covariance matrix 
Si. The statistical model is then 

K 

~ J2^kNp{fik,Tk) (26) 

fe=i 

VM^ - iV(a + /3^^„a2) (27) 
yi,y^i\ni,it ^ Np+i{[rj,,^,],Y,i). (28) 

(29) 

Denoting z,; — {yi,:x.i), the measured data likelihood is 

p{x,y\0,i.) = HE (2^)(p+ivVmP/^ 12^'^ - ^^^'^^M - ^^■) j (30) 
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Cfc = {a + [3^ iik,y.k) (31) 

Here, Cfe is the {p + l)-eleinent mean vector of for Gaussian k, Vk,i is the {p + 1) x {p + 1) covariance 
matrix of z, for Gaussian k, ^ is the variance in the measurement error on y^, (Jxy,i is the p-element vector 
of covariances between the measurement errors on and x^, and Y^^^i is the p x p covariance matrix of the 
measurement errors on x^. 

Similar to the case for one independent variable, the measured data likelihood can be decomposed as 
p{x,y\8,ilj) = p{y\x,9,^)p{x\il;), where p(xi|7/i) = Y.k=i T^kNp{yii\^ik,Tk + T.x,i) and 

p{yi\xi,0,ij) ^ > , , , exp<^ -- , . r^> 33 

TrkN{:s.i\^j.k,Tk + Yx,i) 
1'J^(x^|A^J,TJ + Y.x.t) 

E{y,\x„k) = a + /3V + (/3^7^fc+^Jj;,»)(rfc + S.,,)~'(x, -/Xfc) (35) 
Var{y,\x„k) = /^^T,./? + + a^., - (/J^T^ + aj^ ,,)(Tfe + S,,,)"i(Tfc/3 + (7,^,,,). (36) 



5. DATA COLLECTION ISSUES: SELECTION EFFECTS AND NON-DETECTIONS 

There are several issues common in the collection of astronomical data that violate the simple assump- 
tions made in § 2) Astronomical data collection consists almost entirely of passive observations, and thus 
selection effects are a common concern. Instrumental detection limits often result in the placement of upper 
or lower limits on quantities, and astronomical surveys are frequently flux-limited. In this section I modify 
the likelihood functions described in § H] to include the effects of data collection. 

General methods for dealing with missing data are described in iLittle k Rubh] |2002l ) andlGelman et al 



(j2004[ ). and I apply the methodology described in these references to the measurement error model developed 
here. Although in this work I focus on linear regression, many of these results can be applied to more general 
statistical models, such as estimating luminosity functions. 



5.1. Selection Effects 

Suppose that one collects a sample of n sources out of a possible N sources. One is interested in 
understanding how the observable properties of these sources are related, but is concerned about the effects 
of the selection procedure on the data analysis. For example, one may perform a survey that probes some 
area of the sky. There are TV sources located within this solid angle, where N is unknown. Because of the 
survey's selection method, the sample only includes n sources. In this case the astronomer is interested in 
how measurement error and the survey's selection method affect statistical inference. 

I investigate selection effects within the framework of our statistical model by introducing an indicator 
variable, /, which denotes whether a source is included in the sample. If the i**^ source is included in the 
sample, then /j = 1, otherwise = 0. In addition, I assume that the selection function only depends on the 
measured values, x and y. Under this assumption, the selection function of the sample is the probability of 
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including a source with a given x and y, y). This is commonly the case in astronomy, where sources 

are collected based on their measured properties. For example, one may select sources for a sample based 
on their measured properties as reported in the literature. In addition, if one performs a flux-limited survey 
then a source will only be considered detected if its measured flux falls above some set flux limit. If a sample 
is from a survey with a simple flux limit, then p{Ii = l\yi) = 1 if the measured source flux yi is above the flux 
limit, and p{Ii = l|yi) = if the measured source flux is below the flux limit. Since the selection function 
depends on the measured flux value, and not the true flux value, sources with true flux values above the flux 
limit can be missed by the survey, and sources with true flux below the limit can be detect ed by the survey. 



This effect is well-known in astronomy and is commonly referred to as Malmquist bias (e.g.. lLandv fc Szalav 



19921) 



Including the variable /, the complete data likelihood can be written as 

p{x, y, T], I\d, ip) = p{I\x, y)p{x, y\^, v)p{v\(., 0)p{^\i')- (37) 

Equation p7p is valid for any number of independent variables, and thus Xi and may be either scalar or 
vector. Integrating Equation (|37p over the missing data, the observed data likelihood is 

p{Xobs,yobs\0,-ip, N) cc ( ^ j Y\. P(^i^yi\^''^) 

X n / P^^:> = 0|a;j,?;j)p(a;j,% |Cj,7?j)p(?7ilO>%fe#) dxj dy, d^^ dr]j. (38) 
f N \ . . . 

Here, is the binomial coefhcient, Aobs denotes the set of n included sources, Xobs and yobs denote the 

V " / ^ 

values of x and y for the included sources, and Amis denotes the set oi N — n missing sources. In addition, I 
have omitted terms that do not depend on 0, ^, or N . Note that N is unknown and is thus also a parameter 
of the statistical model. The binomial coefhcient is necessary because it gives the number of possible ways 
to select a sample of n sources from a set of N sources. 

It is apparent from Equation ([55)1 that statistical in ference on the regression parameters is u naffected 



if the selection function is independent of y and x. (e.g.. iLittle fc Rubinll2002t iGelman et al.ll2004l ). In this 
case the selection function may be ignored. 



5.1.1. Selection Based on Measured Independent Variables 

It is commonly the case that a sample is selected based only on the measured independent variables. 
For example, suppose one performs a survey in which all sources with measured optical flux greater than 
some threshold are included. Then, these optically selected sources are used to fit a regression in order to 
understand how the X-ray luminosity of these objects depends on their optical luminosity and redshift. In 
this case, the probability of including a source only depends on the measured values of the optical luminosity 
and redshift, and is thus independent of the X-ray luminosity. 

When the sample selection function is independent of y, given x, then p{I\x,y) = p{I\x). Because we 
are primarily interested in the regression parameters, 9, I model the distributions of ^ for the included and 
missing sources seperately, with the parameters for the distribution of included sources denoted as V'obs- In 
addition, I assume that the measurement errors between y and x are statistically independent. Then the 
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N — n integrals over y and -q for the missing sources in Equation (|38|) are equal to unity, and we can write 
the observed data likelihood as 

p{,Xobs,yobs\0,i}obs) °^Y\.J J Pi^i\i^)Piyi\Vi)p{Vi\£.i,S)p{^i\Ii = IjV'ofcs) d^i drj^, (39) 

i—l 

where p{^i\Ii — l,iJjobs) is the distribution of those ^ included in one's sample. Here I have omitted terms 
depending on N because one is primarily interested in inference on the regression parameters, 9. Equation 
([39|) is identical to Equation (fTTj) . with the exception that p{£,\ip) now only models the distribution of those 
^ that have been included in one's sample, and I have now assumed that the measurement errors on y and a: 
are independent. In particular, for the Gaussian mixture models described in i; 14.11 and i; 14. 3[ the observed 
data likelihood is given by Equations and (|5D|) . where tt, fi, and (or T) should be understood as 
referring to the parameters for the distribution of the observed ^. As is evident from the similarity between 
Equations and pip , if the sample is selected based on the measured independent variables, and if the 
measurement errors on the dependent and independent variables are statistically independent, then inference 
on the regression parameters, 9, is unaffected by selection effects. 



5.1.2. Selection Based on Measured Dependent and Independent Variables 

If the method in which a sample is selected depends on the measured dependent variable, y, or if the 
measurement error in x and y are correlated, the observed data likelihood becomes more complicated. As 
an example, one might encounter this situation if one uses an X-ray selected sample to investigate the 
dependence of X-ray luminosity on optical luminosity and redshift. In this case, the selection function of 
the sample depends on both the X-ray luminosity and redshift, and is thus no longer independent of the 
dependent variable. Such data sets are said to be 'truncated'. 

If the selection function depends on y, or if the measurement errors on y and x are not independent, 
one cannot simply ignore the terms depending on N , since the N — n integrals in Equation (jSSp depend on 
9. However, we can eliminate the dependence of Equation (|38p on the unknown N by applying a Bayesian 
approach. The posterior distribution of 9,ijj, and N is related to the observed data likelihood function as 
p{9,ip,N\xobs,yobs) o: p{9,-)p,N)p{xobs,yobs\9,4',N), where p{9,'ip,N) is the prior distribution of {9,ip,N). 



If we assume a uniform prior on 6',-0, and logiV, then one can show (e.g., iGelman et al.ll2004[) that the 
posterior distribution of 9 and V' is 

n 

p{9, ^\xobs,yobs) « [p{I = 1\9, tA)]-" Y[p{x,,y^\9, i;). (40) 

Here, p{xi,yi\9,ip) is given by Equation (|lip . and p(I = 1|0, V') is the probability of including a source in 
one's sample, given the model parameters, 9 and ip: 

p{I = 1|0, V) = / / Pil = Ik, y)pix, y\9, iP) dx dy. (41) 



I have left off the subscripts for the data points in Equation (j4ip because the integrals are the same for each 
(xj, yj, ^j, rjj). If one assumes the Gaussian mixture model of Sections 14. II and l4. 31 then p{xi, yi\9, "0) is given 
by Equations or ([50)1 . The posterior mode can then be used as an estimate of 9 and tp, which is found 
by maximizing Equation (j40p . 



- 11 - 



5.2. Non-detections 



In addition to issues related to the sample selection method, it is common in astronomical data to have 
non-detections. Such non-detections are referred to as 'censored' data, and the standard procedure is to 
place an upper and/or lower limit on the censored data point. Methods of data analysis for censored data 
have been reviewed and proposed in the as trono mical literature, (e .g., iFeigelson fc NelsonI Il985l : ISchmitt 



1985 



Marshal]|[l993 lAkritas fc Siebertlll996[). andllsobe et al.l |l986 ') describe censored regression when the 



variables are measured without error. See 



FeigelsonI (|l992f ) for a review of censored data in astronomy. 



To facilitate the inclusion of censored data, I introduce an additional indicator variable, Z), indicating 
whether a data point is censored or not on the dependent variable. If yi is detected, then Di = 1, else if 
yi is censored then Di = 0. It is commonly the case that a source is considered 'detected' if its measured 
flux falls above some multiple of the background noise level, say 3a. Then, in this case, the probability of 
detecting the source given the measured source flux jji is p{Di ~ l|2/i) = 1 if J/i > 3cr, and p{Di — 0\yi) = 1 if 
yi < 3(7. Since source detection depends on the measured flux, some sources with intrinsic flux rj above the 
flux limit will have a measured flux y that falls below the flux limit. Similarly, some sources with intrinsic 
flux below the flux limit will have a measured flux above the flux limit. 

I assume that a sample is selected based on the independent variables, i.e., p{I\x,y) = p{I\x). It is 
difficult to imagine obtaining a censored sample if the sample is selected based on its dependent variable, as 
some of the values of y are censored and thus unknown. Therefore, I only investigate the effects of censoring 
on y when the probability that a source is included in the sample is independent of y, given x. In addition, I 
do not address the issue of censoring on the independent variable. Although such methods can be developed, 
it is probably simpler to just omit such data as inference on the regression parameters is unaffected when a 
sample is selected based only on the independent variables (cf. , § I5.1.1[) . 



The observed data likelihood for an x-selected sample is given by Equation ([39|) . We can modify this 
likelihood to account for censored y by including the indicator variable D and again integrating over the 
missing data: 



p{Xobs,yobs,D\0,'lJjobs) cc p{xi,yt p{Xj\llJobs) / p{D 



(42) 

Here, the first product is over the set of data points with detections, Adet, and the second product is over the 
set of data points with non-detections, Acens- The conditional distribution p{yj\xj, 6, ipobs) and the marginal 
distribution p{xj\ipobs) for the Gaussian mixture model are both given in § 14.11 and S 14.31 If the data points 
are measured without error and one assumes the normal regression model, = A^(??|a - |- P^,a^), then 

Equation [42l becomes the censored data likelihood function described in llsobe et al.l (|l986l ). A MLE for 
censored regression with measurement errors is then obtained by maximizing Equation ()42p . 



6. COMPUTATIONAL METHODS 

In this section I describe a Bayesian method for computing estimates of the regression parameters, 9, 
and their uncertainties. The Bayesian approach calculates the posterior probability distribution of the model 
parameters, given the observed data, and therefore is accurate for both small and large sample sizes. The 
posterior distribution follows from Baye's formula as p{9,^p\x,y) oc p{9,Tp)p(x,y\9,ip), where p(9,'ip) is the 
prior distribution of the parameters. I describe some Markov Chain methods for drawing random variables 
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from the posterior, wh ich can then be used to estimate quantities such as standard errors and con f idence 
intervals on 9 and il). Gelman et al. ( 2004 ) is a good reference on Bayesian methods, and iLoredo ( 19921 ) 
gives a review of Bayesian methods intended for astronomers. Further detail s of Markov Chain si mulation, 
including methods for making the simulations more efficient, can be found in lGelman et al.l (2004). 



6.1. The Prior Density 

In order to ensure a proper posterior for t he Gaussian mixture model , it is necessary to invoke a 



proper prior density on the mixture parameters (jRoeder fc WassermanI 1 199711 . I adopt a uniform prior on 
the regression parameters (a, /3, ct^), and take tti, . . . , ttk ~ Dirichlet(l, . . . , 1). The Dirichlet density is a 
multivariate extension of the Beta density, and the Dirichlet (1, . . . , 1) prior adopted in this work is equivalent 
to a uniform prior on tt, under the constraint X]/c=i = 1- 



The pr ior on ii and r (or T ) adopted in this work is very similar to that advocated bv lRoeder &: Wasserman 



(|l997l) and ICarroll et all (Il999[l . I adopt a normal prior on the individual /Zfc with mean /zq and variance 



V? (or covariance matrix U). This reflects our prior belief that the distribution of ^ is more likely to be 
fairly unimodal, and thus that we expect it to be more likely that the individual Gaussians will be close 
together than far apart. If there is only one covariate, then I adopt a scaled inverse-x^ prior on the individual 
r| with scale parameter vo^ and one degree of freedom, otherwise if there are p > 1 covariates I adopt an 
inverse- Wishart prior on the individual Tfe with scale matrix W and p degrees of freedom. This reflects our 
prior expectation that the variances for the individual Gaussian components should be similar, but the low 
number of degrees of freedom accomodates a large range of scales. Both the Gaussian means and variances 
are assumed to be independent in their prior distribution, and the 'hyper-parameters' Ho,u^ (or U), and w'^ 
(or W) are left unspecified. By leaving the parameters for the prior distribution unspecified, they becomes 
additional parameters in the statistical model, and therefore are able to adapt to the data. 

Since the hyper-parameters are left as free parameters they also require a prior density. I assume a 
uniform prior on /ip and w'^ (or W). If there is one covariate, then I assume a scaled inverse-x^ prior for 
with scale parameter and one degree of freedom, otherwise if there are multiple covariate we assume a 
inverse- Wishart prior for U with scale matrix W and p degrees of freedom. The prior on (or U) reflects 
the prior expectation that the dispersion of the Gaussian components about their mean /xq should be on 
the order of the typical dispersion of each individual Gaussian. The prior density for one covariate is then 
p{9,^, IJ.Qju'^ ,w^) (X p{tt)p{ij,\ijo,u'^)p{t'^\w'^)p{u'^\w'^) and is summarized hierarchically as 



a, (3 r 


Uniform(— cxD, oo) 


(43) 




^ Uniform(0, oo) 


(44) 


TT 


- Dirichlet(l,...,l) 


(45) 




- 7V(Aio,u') 


(46) 


2 2 2| 2 


Inv-x^(l,w^) 


(47) 




Uniform(— oo, oo) 


(48) 




Uniform(0, oo). 


(49) 



The prior density for multiple covariates is just the multivariate extension of Equations (j43| - ((49|) . 
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6.2. Markov Chains for Sampling from the Posterior Distribution 

The posterior distribution summarizes our knowledge about the parameters in the statistical model, 
given the observed data and the priors. Direct computation of the posterior distribution is too computation- 
ally intensive for the model described in this work. However, we can obtain any number of random draws 
from the posterior using Markov chain monte carlo (MCMC) methods. In MCMC methods, we simulate 
a Markov chain that performs a random walk through the parameter space, saving the locations of the 
walk at each iteration. Eventually, the Markov chain converges to the posterior distribution, and the saved 
parameter values can be treated as a random draw from the posterior. The random draws can then be used 
to estimate posterior medians, standard errors, of plot histogram estimates of the posterior. 



6.2.1. Gihhs Sampler for the Gaussian Model 

The easiest method for sampling from the posterior is to construct a Gibbs sampler. The basic idea 
behind the Gibbs sampler is to construct a Markov Chain, where new values of the model parameters and 
missing data are simulated at each iteration, conditional on the values of the observed data and the current 
values of the model parameters and the missing data. Within the context of the measurement error model 
considered in this work, the Gibbs Sampler undergoes four different stages. 

The first stage of the Gibbs sampler simulates values of the missing data, given the measured data and 
current parameter values, a process known as data augmentation. In this work the missing data are 77,^, 
and any non-detections. In addition, I introduce an additional latent variable, Gi, which gives the class 
membership for the z'^ data point. The vector has K elements, where Gik = 1 if the i*'' data point comes 
from the fc*** Gaussian, and Gy = if j ^ fc. I will use G to refer to the set of n vectors G^. Noting that tt^ 
gives the probability of drawing a data point from the fc'^ Gaussian, the mixture model for ^ may then be 
expressed hierarchically as 

Gi|7r ~ Multinom(l,7ri, . . . ,7ri^) (50) 

i^\G,k^l,^Xk.Tl ^ NiiXk.Tl). (51) 

Here, Multinom(m,pi, . . . ^px) is a multinomial distribution with m trials, where pk is the probability of 
success for the k^^ class on any particular trial. The vector G^ is also considered to be missing data, and is 
introduced to simplify construction of the Gibbs sampler. 

The new values of the missing data simulated in the data augmentation step are then used to simulate 
new values of the regression and Gaussian mixture parameters. The second stage of the Gibbs sampler 
simulates values of the regression parameters, 6, given the current values of xi and 77. The third stage 
simulates values of the mixture parameters, '0, given the current values of ^ and 77. The fourth stage uses 
the new values of 9 and ij^ to update the parameters of the prior density. The values of the parameters are 
saved, and the process is repeated, creating a Markov Chain. After a large number of iterations, the Markov 
Chain converges, and the saved values of 9 and ^ from the latter part of the algorithm may then be treated 
as a random draw from the posterior distribution, p{9, '4'\x, y). 

Methods for simul ating random variables from the distributions used f or this Gibbs sampler are described 



in various works (e.g.. iRiplevlllQSTt iPress et al.lll992l : ICelman et al.ll2004l ). 



A Gibbs sampler for the Gaussian mixture model is 



- 14 - 



1. Start with initial guesses for r/, G, 0, tJj, and the prior parameters. 



2. If there are any non-detections, then draw yi for the censored data points from p{yi\rii, Di = 0) oc 
p{Di — 0\yi)p{yi\'r]i) ■ This may be done by first drawing yi from p{yi\i]i): 

y,h-N{7j,,al,). (52) 

One then draws a random variable Ui, uniformly-distributed on [0, 1]. If Ui < p{Di — Q\yi) then the 
value of yi is kept, otherwise one draws a new value of yi and Ui until Ui < p{Di ~ Q\yi). 

3. Draw values of ^ from p{^\x,y,r],G,9,ip)- The distribution p{^\x,y,r], 0,9,1/:) can be derived from 
Equations p^ - p^ or and the properties of the multivariate normal distribution: 



(a) If there is only one independent variable then ^i is updated as: 
^i\xi,yi,r]i,Gi,9,il; ~ N{£_i,a} ) 



a] 



K 

E 



K 



£,xy,i 



'-[m - Vi) 



fe=i 



1 0^ 



(53) 
(54) 

(55) 
(56) 

(57) 
(58) 



Here, Pxy,i = <^xy,i/ {(^x,i(^y,i) IS the correlation between the measurement errors on Xi and yi. 
Note that ^i is updated using only information from the fc'^ Gaussian, since Gij — 1 only for 
j = k and Gij = otherwise, 
(b) If there are multiple independent variables, I have found it easier and computationally faster to 
update the values of using a scalar Gibbs sampler. In this case, the p elements of are updated 
individually. I denote ^ij to be the value of the j*^ independent variable for the i^^ data point, 
and Xij to be the measured value of ^y . In addition, I denote Ci.-j to be the {p — l)-element 
vector obtained by removing from ^i, i.e., = (^a, ■ • • ■ ■ ■ ,^ip)- Similarly, 

/3_j denotes the {p — l)-element vector of regression coefficients obtained after removing (3j from 
(3. Then, ^.y is updated as 



6j|xj,?;j,Gj,^j_j,77,,6i,V' 



K 

E 

fc=i 



(59) 

Gikiijk (60) 

),+! + (r-V*fe), + - « - /3^,6,-,)/a^ 



(S-i)(,+i)(,+i) + (T,7^)„-+/3|/^2 

yi -rjiii 1 = 1 
x.a if / = j + 1 
Xii - j + 1 



(61) 
(62) 
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{fik)i if 1= j 



K 



k=l 



2 



1) + (Tk% + i 



(63) 
(64) 

(65) 



Here, z* is a (p + l)-element vector obtained by subtracting (?7i,^i) from = {yi,Xi), with the 
exception of the j*'' element of ^i; instead, the (j + 1)*'' element of z* is just Xij. The p-element 
vector fi*f. is obtained in an equivalent manner. The (p+ 1) x {p+ 1) matrix is the covariance 
matrix of the measurement errors on z^. The term (E~^z*)(j+i) denotes the (j + 1)'^ element of 
the vector E^^z*, and hkewise for {T,r^fJ'*k)j- The terms (E~^)q_|.i)q_|_i) and {Tjr^)jj denote the 
(j + 1)"^ and j*^ elements of the diagonals of E^^ and 7)7^, respectively. This step is repeated 
until all p independent variables have been updated for each data point. 



If any of the are measured without error, then one simply sets = Xi for those data points. 

4. Draw values of rj from p{ri\x,y,^,9). Similar to ^, the distribution p{r]\x,y,£,,9) can be derived from 
Equations (fT2 |) -(fT4 | or ((26|) ~([28 |) and the properties of the multivariate normal distribution. 



(a) If there is only one covariate then rj is updated as 



2 



(b) If there are multiple covariates then 77 is updated as 

(Eriz*)i + (a + /3^e.)^' 



-I 



(s. )ii + ^ 



1/0-2 

-1 



(66) 
(67) 

(68) 

(69) 
(70) 

(71) 
(72) 



Here, (E^ ^z*)i is the first element of the vector E^ ^z*, z* is a (p+ l)-element vector whose first 
element is yi and remaining elements are — ^i, and (E~^)ii is the first diagonal element of E~^. 

If any of the 77 are measured without error, then one sets rj — y for those data points. 

5. Draw new values of the Gaussian labels, G. The conditional distribution of Gj is Multinomial with 
number of trials m = 1 and group probabilities = p(Gik ~ l|i^i;, "0): 



Multinom(l, qi, . . . , q^) 
TrkNp{^i\fj,k, Tk) 



(73) 
(74) 
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Note that if there is only one covariate then p = 1 and Tfc = r^. 

6. Draw (a, (3) froinp(a, (3\^, t], a^). Given ^, r], and cr^, the distribution of a and /? is obtained by ordinary 
regression: 

a,/3|^,7?,c7^ ~ iVp+i(c,Se) (75) 
c = (X^X)-iX^ry (76) 
Ee = (X^X)-V2. (77) 

Here, X is a n x (p+ 1) matrix, where the first column is a column of ones, the second column contains 
the n values of for the first independent variable, the third column contains the n values of for 
the second independent variable, etc. 

7. Draw a new value of from p((t^|^, f], a, (3). The distribution p(cr^|^, ry, a, /3) is derived by noting that 

given a, (3 and rji is normally distributed with mean a + 0^ and variance cr^. Re-expressing this 
distribution in terms of instead of rj, and taking the product of the distributions for each data point, 
it follows that cr^ has a scaled inverse-x^ distribution: 

c72|e,r,,a,/3 ~ Inv-x'(i.,s2) (78) 
V = n-2 (79) 



^ n-2 

1=1 



1 " 

— ^(r?,-a-/3^e.)'- (80) 



8. Draw new values of the group proportions, n. Given G, w follows a Dirichlet distribution: 

7r|G ~ Dirichlet(ni + 1, . . . , + 1) (81) 

n 

rik = $^G,fe. (82) 

i=l 

Note that rife is the number of data points that belong to the fc**^ Gaussian. 

9. Draw a new value of fik from p{fik\^,G,Tk, fJ,o,U). If there is only one independent variable, then 
Tfe = and U = u^. The new value of /Zfe is simulated as 

Mfe|€,G,Tfe,Mo,t/ - iVp(Afc,S^J (83) 

Afc = (C/-i+nfeT-^)-i([/-Vo + nfeT-ia) (84) 
1 " 

= —T^Gik^i (85) 

S^, = ([/-i+nfeT-i)-!. (86) 

10. Draw a new value of r| or Tfe. The distribution of t^|^,/x or Tfe|^,/K is derived in a manner similar to 
(T^l^, ?7,Q!,/3, and noting that the prior is conjugate for this likelihood. The distribution of t^|^, M is a 
scaled inverse-x^ distribution, and the distribution of Tfel^, /i is an inverse- Wishart distribution: 

(a) If there is only one independent variable then draw 

4\^,G,Hk,w'' ~ Iny-x^{uk,tl) (87) 

i/fe = rife + l (88) 

1 



+ Gjfc($i - /Xfe)^ 



(89) 



(b) If there are multiple independent variables then draw 

Tk%G,iik,W ~ Inv-Wishart,,(5fc) (90) 
Vk = rik+p (91) 

n 

Sk = W + J2G^k{^^- ^ikm- ^lkf■ (92) 
i=l 

11. Draw a new value for /xoIM) U. Noting that conditional on /xq and U, yUi, . . . ,ij,k are independently 
distributed as Np{ij,o, U), it is straight-forward to show that 



Mo|M,t^ ~ Npii2,U/K) (93) 

K 

i 



= ^E^^- (94) 



2 



If there is only one covariate then p ^ 1 and U = u 

12. Draw a new value for or [/, given /xo,/i, and w'^ (or M^). Similar to the case for or T^, the 
conditional distribution of v? or J/ is scaled inverse-x^ or inverse-Wishart. 

(a) If there is only one covariate then 

u^|/io,/",w^ ~ Inv-x^(^'u,M^) (95) 
Uu = K+1 (96) 

K 



k=l 



(97) 



(b) If there are multiple covariates then 



U\no,l^,W ~ Inv-Wishart^^(C)') (98) 
i^u = K + p (99) 



K 



U = W + ^{iik-l^o){l^k-l^of. (100) 



k=l 

13. Finally, draw a new value of w'^\u'^, or W\U, T: 

(a) If there is only one covariate then w'^\u'^,t'^ is drawn from a Gamma distribution. This can be 
derived by noting that p(i(;^|'U^,T^) oc p(u^\w^)p{t^\w^) has the form of a Gamma distribution as 
a function of w"^. The new value of is then simulated as 

w'^\u^,t'^ ~ Gamma(a, 6) (101) 
a = l-iK + 3) (102) 
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(b) If there are multiple covariates then W\U,T is drawn from a Wishart distribution. This can be 
derived by noting that p{W\U,T) oc p{U\W)p{T\W) has the form of a Wishart distribution as a 
function of W. The new value of W is then simulated as 

W\U,T - Wishart^„(iy) (104) 
i^w = {K + 2)p+l (105) 

W = {U-'+Y.^-')-'- (106) 
fe=i 

After completing steps[^HT51above. an iteration of the Gibbs sampler is complete. One then uses the new 
simulated values of 6, ip, and the prior parameters, and repeats steps [2 HT31 The algorithm is repeated 
until convergence, and the values of 9 and ijj at each iteration are saved. Upon reaching convergence, one 
discards the values of 9 and Tp from the beginning of the simulation, and the remaining values of a, /3, cr^, /i, 
and (or T) may be treated as a random draw from the posterior distribution, p{9,^l)\x,y). One can then 
use these values to calculate estimates of the parameters, and their corresponding variances and confidence 
intervals. The posterior distribution of the parameters can also be estimated from these values of 9 and ij; 
using histogram tech niques. Techniques for monitoring convergence of the Markov Chains can be found in 



Gelman et all (|2004f ) 



The output from the Gibbs sampler may be used to perform Bayesian inference on other quantities of 
interest. In particular, the Pearson linear correlation coefficient, p, is often used in assessing the strength 
of a relationship between the x and y. A random draw from the posterior distribution for the correlation 
between rj and ^j, denoted as pj, can be calculated from Equation ([5|) for each draw from the Gibbs sampler. 
For the Gaussian mixture model, the variance Var{ri) and covariance matrix = Var{^) are 

Var{ri) = (3^1:^(3 + (107) 

K 



= Y^TTkiTk+phpl)-^^ (108) 

fc=i 

K 
k=l 

and Var{^j) is the j'^ diagonal element of E^. The simplification for one covariate is self-evident. 

If there is considerable posterior probability near « or t| « 0, then the Gibbs sampler can get 
'stuck'. For example, if t| « 0, then step[3a]of the Gibbs sampler will draw values of w pk- Then, step 
|9] will produce a new value of pk that is almost identical to the previous iteration, step II Gal will produce a 
new value of « 0, and so on. The Gibbs sampler will eventually get 'unstuck', but this can take a long 
time and result in very slow convergence. In particular, it is very easy for the Gibbs sampler to get stuck 
if the measurement errors are large relative to or r|, or if the number of data points is small. In this 
situation I have found it useful to use the Metropolis-Hastings algorithm instead. 



6.2.2. Metropolis- Hastings Algorithm 



If the selection function is not independent of y, given the independent variables (cf. Eq.[lD]), or if 
the selection function depends on x and the measurement errors are correlated, then posterior simulation 
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based on the Gibbs sampler is more complicated. In addition, if the measurement errors are large compared 
to the intrinsic dispersion in the data, or if the sample size is small, then the Gibbs sampler can become 
stuck and extremely inefficient. In both of these cases one c an use the Metropolis-Hastings algorithm 
( Metropolis fc Ulam 1949t Metropolis et al. 19531 Hastings 1970 ) to sample from the posterior distribution, 



as the Metropolis-Hasting algorithm can avoid construct ing markov chains for £ an d ri. For a description o f 
the Metropolis-Hastings algorithm, we refer the reader to lChib fc Greenberg (jl995r ) or lGelman et al.l (j2004r ). 



7. SIMULATIONS 

In this section I perform simulations to illustrate the effectiveness of the Gaussian structural model for 
estimating the regression parameters, even in the presence of severe measurement error and censoring. In 
addition, I compare the OLS, BCES(F|X), and FITEXY estimators with a maximum-likelihood estimator 
based on the Gaussian mixture model with K = 1 Gaussian. 



7.1. Simulation Without Non-Detections 

The first simulation I performed is for a simple regression with one independent variable. I generated 
2.7 X 10^ data sets by first drawing n values of the independent variable, ^, from a distribution of the form 

p(0 «e« (l + e2-75e-)-i_ (^^0) 

The distribution of ^ is shown in Figure [21 along with the best-fitting one and two Gaussian approximations. 
In this case the two Gaussian mixture is nearly indistinguishable from the actual distribution of ^, and 
thus should provide an excellent approximation to p{^). The values for ^ had a mean of /i = —0.493 and a 
dispersion of r = 1.200. I varied the number of data points in the simulated data sets as n = 25, 50, and 
100. I then simulated values of 77 according to Equation with a = 1.0 and /3 = 0.5. The intrinsic scatter, 
e, had a normal distribution with mean zero and standard deviation a = 0.75, and the correlation between 
r] and ^ was p w 0.62. The joint distribution of ^ and rj for one simulated data set with n = 50 is shown in 
Figure [31 

Measured values for ^ and 77 were simulated according to Equations ([2]) and ([3]). The measurement 
errors had a zero mean normal distribution of varying dispersion and were independent for x and y. The 
variances in the measurement errors, cr^ ^ and cr^j, were different for each data point and drawn from a 
scaled inverse-x^ distribution. The degrees of freedom for the inverse-x^ distribution was v = 5, and the 
scale parameters are denoted as t and s for the x and y measurement error variances, respectively. The 
scale parameters dictate the typical size of the measurements errors, and were varied as t = 0.5r, r, 2r and 
s — 0.5a, tr, 2a. These values corresponded to values of Rx ~ 0.2, 0.5, 0.8 and Ry ^ 0.15, 0.4, 0.6 respectively. 
I simulated 10"* data sets for each grid point of t, s, and n, giving a total of 2.7 x 10^ simulated data sets. 
The joint distributions of x and y for varying values oit/r and s/a are also shown in Figure [3l These values 
of X and y are the 'measured' values of the simulated data set shown in the plot of ry as a function of ^. 

For each simulated data set, I calculated the maximum-likelihood estimate, found by maximizing Equa- 
tion (Uni). For simplicity, I only use K = 1 Gaussian. I also calculated the OLS, BCES(r|X), and FITEXY 
estimates for comparison. I calculated a OLS estimate of a^ by subtracting the average cr^ from the variance 



in the regression residuals. If the OLS estimate of ct^ was negative, I set aoLS = 0- Following iFulleij (|l987l ). 



I estimate a^ for a BCES(y |X)-type estimator as a^^'ES ~ ^O'^iu) ~ '^y ~ $BCEsCov{x,y), where ay is the 
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average measurement error variance in and /9rcf„s is t h e BC ES(y|X) estimate of the slope. If cr%cES 
is negative, I set (Jbces — 0. Following Tremaine et al. ( 20021 ). I compute a FITEXY estimate of a by 



increasing until x\xy li^ ~ 2) = 1, or assume cr^ = if X^exy li''^ — 2) < 1. The sampling distributions 
of the slope and intrinsic scatter estimators for n = 50 are shown in Figures 2] and [5] as a function of f/r, 
and the results of the simulations are summarized in Table [TJ 

The bias of the OLS estimate is apparent, becoming more severe as the measurement errors in the inde- 
pendent variable increase. In addition, the variance in the OLS slope estimate decreases as the measurement 
errors in ^ increase, giving one the false impression that one's estimate of the slope is more precise when the 
measurement errors are large. This has the effect of concentrating the OLS estimate of j3 around j3oLS ^ 0, 
thus effectively erasing any evidence of a relationship between the two variables. When the measurement 
errors are large, the OLS estimate of the intrinsic scatter, o'Qj^g, is occasionally zero. 

The BCES(F|X) estimator performs better than the OLS and FITEXY estimators, being approximately 
unbiased when the measurement errors are ax/r < 1. However, the BCES estimate of the slope, /3bces = 
Cov{x,y)/ {Var{x) — ct^), suffers some bias when the measurement errors are large and/or the sample size 
is small. In addition, the variance in Pbces is larger than the MLE, and Pbces becomes considerably 
unstable when the measurement errors on ^ are large. This instability results because the denominator in 
the equation for /3bces is Var{x) — ct^. If ct^ ~ Var{x), then the denominator is ~ 0, and /3bces can 
become very large. Similar to the OLS and FITEXY estimates, the estimate of the intrinsic variance for the 
BCES-type estimator is often zero when the measurement errors are large, suggesting the false conclusion 
that there is no intrinsic scatter about the regression line. 

The FITEXY estimator performed poorly in the simulations, being both biased and highly variable. 
The bias of the FITEXY estimator is such that $exy tends to overestimate /3, the severity of which tends 



to increase as Ry decreases. This upward bias in (Sexy has been noted by IWeiner et al.l (j2006| ). who 
also performed simulations comparing Pexy with Pbces- They note that when one minimizes Xexy 
alternatively with respect to (3 and a^, and iterates until convergence, then the bias in I3exy can be improved. 
I have tested this and also find that the bias in (3exy is reduced, but at the cost of a considerable increase 
in variance in (3exy- In general, our simulations imply that the variance of the FITEXY estimator is 
comparable to that of the BCES(y|X) estimator if one does not iterate the minimization of XexY' ''^^'^ 
the variance of Pexy is larger if one does iterate. However, since 13bces is approximately unbiased when 
Rx is not too large, $bc'es should be preferred over (3exy- In addition, when the measurement errors are 
large the FITEXY estimate of a is commonly (Jexy — 0, similar to the BCES-type estimate of the intrinsic 
dispersion. 

The maximum-likelihood estimator based on the Gaussian structural model performs better than the 
OLS, BCES, and FITEXY estimators, and gives fairly consistent estimates even in the presence of severe 
measurement error and low sample size. The MLE is approximately unbiased, in spite of the fact that the 
MLE incorrectly assumes that the independent variables are normally distributed. The variance in the MLE 
of the slope, (3mle, is smaller than that of Pbces and (3exy, particularly when R^ is large. In contrast 
to the OLS estimate of the slope, the dispersion in Pmle increases as the measurement errors increases, 
reflecting the additional uncertainty in Pmle caused by the measurement errors. Finally, in contrast to 
the other estimators, the MLE of the intrinsic variance is always positive, and the probability of obtaining 
^MLE = is negligible for these simulations. 



I argued in § 14.11 that assuming a uniform distribution on ^ does not lead to better estimates than 
the usual OLS case. I also used these simulations to estimate the sampling density of the MLE assuming 
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p{^) oc 1. The results were nearly indistinguishable from the OLS estimator, supporting our conjecture that 
assuming p{^) oc 1 does not offer an improvement over OLS. 

While it is informative to compare the samphng distribution of our proposed maximum-hkehhood es- 
timator with those of the OLS, BCES(y|X), and FITEXY estimators, I do not derive the uncertainties in 
the regression parameters from the sampling distribution of the MLE. As described in § 16.21 we derive the 
uncertainties in the regression parameters by simulating draws from the posterior distribution, p{6,ip\x,y). 
This allows a straight-forward method of interpreting the parameter uncertainties that does not rely on large- 
sample approximations, as the posterior distribution is the probability distribution of the parameters, given 
the observed data. The posterior distributions of p, and a for a simulated data set with n = 50, ~ t, 
and ay ^ a is shown in Figure [6l When estimating these posteriors, I used K = 2 Gaussians in the mixture 
model. As can be seen from Figure [6l the true values of p, /3, and a are contained within the regions of 
non-negligible posterior probability. 1 have estimated posteriors for other simulated data sets, varying the 
number of data points and the degree of measurement error. As one would expect, the uncertainties in the 
regression parameters, represented by the widths of the posterior distributions, increase as the size of the 
measurement errors increase and the sample size decreases. 

A common frequentist approach is to compute the covariance matrix of the MLE by inverting the 
estimated Fisher information matrix, evaluated at the MLE. Then, under certain regularity conditions, the 
MLE of the parameters is asympotically normally distributed with mean equal to the true value of the 
parameters and covariance matrix equal to the inverse of the Fisher information matrix. Furthermore, under 
these regularity conditions the posterior distribution and sampling distribution of the MLE are asymptotically 
the same. Figure [7] compares the posterior distribution of the slope for a simulated data set with that inferred 
from the MLE. The posterior and MLE was calculated assuming K — 1 Gaussian. As can be seen, the 
posterior distribution for P is considerably different from the approximation based on the MLE of f3, and 
thus the two have not converged for this sample. In particular, the posterior is more skewed and heavy-tailed, 
placing more probability on values of /3 > than does the distribution approximated by the MLE. Therefore, 
uncertainties in the MLE should be interpreted with caution if using the asymptotic approximation to the 
sampling distribution of the MLE. 

7.2. Simulation With Non-Detections 

To assess the effectiveness of the Gaussian structural model in dealing with censored data sets with 
measurement error, I introduced non-detections into the simulations. The simulations were performed in an 
identical manner as that described in §[731 but now 1 only consider sources to be 'detected' ii y > 1.5. For 
those sources that were 'censored' (y < 1.5), I placed an upper limit on them of y = 1.5. 

I focus on the results for a simulated data set with n ~ 100 data points and measurement errors similar 
to the intrinsic dispersion in the data, ay ^ a and a^ t. The detection threshold of y > 1.5 resulted in a 
detection fraction of ^ 30%. This simulation represents a rather extreme case of large measurement errors 
and low detection fraction, and provides an interesting test of the method. In Figure[8]I show the distribution 
of ^ and ?7, as well as the distribution of their measured values, for one of the simulated data sets. For this 
particular data set, there were 29 detections and 71 non-detections. As can be seen, the significant censoring 
and large measurement errors have effectively erased any visual evidence for a relationship between the two 
variables. 

I estimated the posterior distribution of the regression parameters for this data set using the Gibbs sam- 
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pier (cf, I6.2.ip with K = 2 Gaussians. The posterior median of the regression Hne, as well as the 95% {2a) 
pointwise confidence intervals^ on the regression line are shown in Figure [H] The posterior distributions for 
p, f3, and cr are shown in Figure O As can be seen, the true value of the parameters is contained within the 
95% probability regions, although the uncertainty is large. For this particular data set, we can put limits 
on the value of the correlation coefficient as 0.2 < p < 1 and the slope as < P < 2.0. For comp arison, the 



usual maximum- likelihood estimate that ignores the measurement error fe.g.. Ilsobe et al.lll986l ) concludes 



P = 0.229 ± 0.077. This estimate is biased and differs from the true value of /3 at a level of 3.5(7. 

The posterior constraints on the regression parameters are broad, reflecting our considerable uncertainty 
in the slope, but they are sufficient for finding a positive correlation between the two variables, f and rj. 
Therefore, despite the high level of censoring and measurement error in this data set, we would still be able 
to conclude that rj increases as f increases. 



8. APPLICATION TO REAL ASTRONOMICAL DATA: DEPENDENCE OF ON 

Lboi/LEdd FOR RADIO-QUIET QUASARS 

To further illustrate the effectiveness of the method, 1 apply it to a data set drawn from my work 
on investigating the X-ray properties of radio-quiet quasars (RQQs). Recent work has suggested a corre- 
lation between quasar X-ray spectra l slope , ax, fv cx: and quasar Eddington ratio, Lboi/LEdd (e.g., 

Porquet et al. 2004; Piconcelli et al. 2005t Shemmer et al. 2006 ). In this section I apply the regression 
method to a sample of 39 z < 0.83 RQQs and confirm the T x-Lboi/ Lsdd correlation. Because the purpose 
of this section is to illustrate the use of this regression method on real astronomical data, I defer a more 
in-depth analysis to a future paper. 

Estimation of the Eddington luminosity, LEdd cx Mbh, requires an estimate of the black hole mass, 
Mbh- Black hole virial masses may be estimated as Mbh oc where R is the broad line region size, 

and V is the velocity dispersion of the gas emitting the broad emission lines. A correlation has been 
foun d between the lu minosity of a source and the size of it's broad line region (the R-L relationship, 
e.g.. iKaspi et al.l 120051 ). One can then exploit this relationship, and use the broad li ne FWHM as an es- 
timate for i;, o btaining virial mass estimates M^g oc L^v^ fe.g.. IWandel et al.lll999r ). where the exponent 
\s 9 ~ 0.5 (e.g.. IVestergaard &: Peterson! 120061 ). Unfortunately, the uncertainty on t he broad line estimates 
of Mf-H can be considerable, having a standar d deviation of ~ 0.4 dex (e.g., McLure &: JarvisI 20021 : 
Vestergaard fc Petersonl l2006t iKellv et al.l 120071 ). For ease of comparison with previous work, I estimate 
Mbh using only the H/3 emission line. The logarithm of the virial mass estimates were calculated using the 
H/3 luminosity and FWHM according to the relationship given by Vestergaard fc Peterson ( 2006h . 

My sample consists of a subset of the sample of Kellv fc Bechtoldl (2007). These sources have measure- 
ments of the X-ray photon index, Tx — olx + 1, obtained from Chandra observations, and measurements 
of the optical/UV luminosity at 2500A, denoted as ^2500, obtained from SDSS spectra. The H/3 profile was 
m odeled as a sum of Gaussians and extracted from the SDSS spectra according to the procedure described 



Kellv et al.l (j2007l ). I estimated the H/3 FWHM and luminosity from the line profile fits. 



I estimate the bolometric luminosity, Lboi, from the luminosity at 2500A, assuming a constant bolometric 



^Technically, these are called 'credibility intervals', as I am employing a Bayesian approach. These intervals contain 95% of 
the posterior probability. While the difference between confidence intervals and credibility intervals is not purely semantical, I 
do not find the difference to be significant within the context of my work, so I use the more familiar term 'confidence interval'. 
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cor rection Lhni = 5.6L 2Rnn (jElvis et al.lll994f) . The standard deviation in this bolometric correction reported 
by lElvis et ahl (|l994[ ) is 3.1, implying an uncertainty in logLboi of atoi ^ 0.25 dex. Combining this with 
the ~ 0.4 dex uncertainty on log Mbh, the total 'measurement error' on XogLboi/ LEdd becomes ~ 0.47 
dex. The distribution of Tx as a function of log L\yoi / L Edd is shown in Figure 1101 As can be seen, the 
measurement errors on both Tx and log Lboi/L Edd are large and make a considerable contribution to the 
observed scatter in both variables, where Ry ~ 0.1 and Rx ~ 0.8. Therefore, we expect the measurement 
errors to have a significant effect on the correlation and regression analysis. 

I performed the regression assuming the linear form Tx — a + f3 log Lhoi/ L Edd, and modelleling the 
intrinsic distribution of log Ljyoi / L^dd using K = 2 Gaussians. Draws from the posterior were obtained 
using the Gibbs sampler. The marginal posterior distributions for /3, a, and the correlation between Tx and 
log Lboi/Ledd, p, are shown in Figure [TTl and the posterior median and 95% (2a) pointwise intervals on the 
regression line are shown in Figure [TOl The posterior median estimate of the parameters are d = 3.12 ± 0.41 
for the constant, (3 — 1.35 ± 0.54 for the slope, a = 0.26 ± 0.11 for the intrinsic scatter about the regression 
line, fi^ = —0.77 ± 0.10 for the mean of logLboi/LEdd, and ct^ — 0.32 ± 0.12 dex for the dispersion in 
log Lhoi/Ledd- Here, I have used a robust estimate of the posterior standard deviation as an 'error bar' on 
the parameters. These results imply that the observed scatter in log Lboi/LEdd is dominated by measurement 
error, <7x/t ^ 1-5, as expected from the large value of R^- 

For comparison, the BCES(y|X) estimate of the slope is $bces — 3.29 ±3.34, the FITEXY estimate is 
$EXY = 1.76±0.49, and the OLS estimate is Pols = 0.56±0.14; the standard error on $exy was estimated 
using bootstrapping. Figure [10] also compares the OLS, BCES, and FITEXY best-fit lines with the posterior 
median estimate. The 95% confidence region on the slope implied by the posterior draws is 0.46 < (3 < 3.44, 
whereas the approximate 95% confidence region implied by the BCES, FITEXY, and OLS standard errors 
are -3.26 < f3 < 9.84, 0.80 < (3 < 2.72, and 0.42 < /3 < 0.70, respectively. The OLS and FITEXY estimates 
and the Bayesian approach give 'statistically significant' evidence for a correlation between log Lhoi/L Edd 
and Tx', however the BCES estimate is too variable to rule out the null hypothesis of no correlation. As 
noted before, the large measurement errors on log Lboi/L Edd bias the OLS estimate of /? toward shallower 
values and the FITEXY estimate of f3 toward steeper values. Because of this bias, confidence regions based 
on (3oLS and (3exy are not valid because they are not centered on the true value of (3, and thus do not 
contain the true value with the stated probability (e.g., 95%). On the other hand, confidence regions based 
on the BCES estimate are likely to be approximately valid; however, in this example the large measurement 
errors have caused (3bces to be too variable to give meaningful constraints on the regression slope. 

The BCES-type estimate of the intrinsic dispersion was &bces = 0.32 and the OLS estimate of the 
intrinsic dispersion was ctols = 0.41, where both were calculated in the same manner as in § 17.11 The 
FITEXY estimate of the intrinsic dispersion was ^exy = 0, as X^xy/l"- ^ 2) < 1. The BCES-type estimate 
of a is similar to the Bayesian posterior median estimate, while uq ls overestimates the scatter compared to 
the Bayesian estimate by « 58%. In contrast, the FITEXY estimator does not find any evidence for intrinsic 
scatter in the regression, which is inconsistent with the posterior distribution of a. 

From the posterior distribution, we can constrain the correlation between T x and log Li,oi / L Edd to be 
0.328 < (0 < 0.998 with k, 95% probability, confirming the positive correlation between Tx and Eddington 
ratio seen previously. The posterior median estimate of the correlation is p = 0.87, compared with an 
estimate of f = 0.54 if one naively calculates the correlation directly from the measured data. The large 
measurement errors significantly attenuate the observed correlation, making the observed correlation between 
Fx and log Lboi/Ledd appear weaker than if one does not correct for the measurement errors. 
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9. CONCLUSIONS 

In this work I have derived a likehhood function for handUng measurement errors in hnear regression 
of astronomical data. Our probabihty model assumes that the measurement errors are Gaussian with zero 
mean and known variance, that the intrinsic scatter in the dependent variable about the regression line is 
Gaussian, and that the intrinsic distribution of the independent variables can be well approximated as a 
mixture of Gaussians. I extend this model to enable the inclusion of non-detections, and describe how to 
incorporate the data selection process. A Gibbs sampler is described to enable simulating random draws 
from the posterior distribution. 

I illustrated the effectiveness of structural Gaussian mixture model using simulation. For the specific 
simulations performed, a maximum-likelihood estimator based on the Gaussian structural model performed 
better than the OLS, BCFiS{Y\X), and FITEXY estimators, especially when the measurement errors were 
large. In addition, our method also performed well when the measurement errors were large and the detection 
fraction was small, with the posterior distributions giving reasonable bounds on the regression parameters. 
These results were in spite of the fact that the intrinsic distribution of the independent variable was not 
a sum of Gaussians for the simulations, suggesting that approximating the distribution of the independent 
variable as a mixture of Gaussians does not lead to a significant bias in the results. Finally, I concluded 
by using the method to fit the radio-quiet quasar X-ray photon index as a function of log L^oi / L Edd, using 
a sample of 39 z < 0.83 sources. The posterior distribution for this data set constrained the slope to be 
< /3 < 3.5 and the linear correlation coefficient to be 0.2 < p < 1.0, confirming the correlation between 
X-ray spectral slope and Eddington ratio seen by other authors. 

Although I have focused on linear regression in this work, the approach that I have taken is quite 
general and can be applied to other applications. In particular. Equations ifTTj) . (|40l) . and (|42)) are derived 
under general conditions and are not limited to regression. In this work, I assume forms for the respective 
probability densities that are appropriate for linear regression; however, these equation provide a framework 
for constructing more general probability models of one's data, as in, for example, nonlinear fitting (e.g., ) 
or estimation of luminosity functions. 

IDL routines for constructingA/farkov Chains for sampling from the posterior are publicly available from 
the IDL astronomy user's library □ or directly from B. Kelly. 

This work was supported in part by NSF grant AST-0307384. The author would like to thank the referee 
for comments that contributed to the improvement of this paper, and for providing some of the references 
to the statistics literature. The author would also like to thank Jill Bechtold, Eric Feigelson, and Aneta 
Siemiginowska for looking over and offering helpful comments on an early version of this paper. 
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Fig. 1. — The fractional bias in the correlation coefficient when the data are contaminated with measurement 
error. The fractional bias is shown as a function of the contribution of measurement error to the observed 
variance in both x and y, for uncorrelated measurement errors. When the measurement errors make up 
^ 50% of the observed variance in both x and y, the observed correlation coefficient is reduced by about 
~ 50%. 
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Fig. 2. The actual distribution of ^ (solid line) for the simulations, compared with the best-fitting one 
(dashed line) and two (dashed-dotted line) Gaussian fit. The two Gaussian fit is nearly indistinguishable 
from the true p(^). Althought the one Gaussian fit provides a reasonable approximation to the distribution 
of ^, it is not able to pick up the asymmetry in p(^). 
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Fig. 3. — Distributions of the simulated data for various levels of measurement error (cf., § 17. ip . The top 
left panel shows the distribution of ?7 as a function of ^ for one simulated data set; the sohd hne is the true 
value of the regression line. The remaining panels show the distributions of the observed values, y and x, for 
various levels of measurement error. The data point with error bars in each panel is a fictitious data point 
and is used to illustrate the median values of the error bars. The box outlines the bounds of the plot of rj 
against ^. As can be seen, large measurement errors wash out any visual evidence for a correlation between 
the variables. 
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Fig. 4. — The sampling distributions of the slope estimators as a function of covariate measurement error 
magnitude for n = 50 data points and Oy a, inferred from simulations (cf., 17. ip . The estimators are 
the ordinary least-squares estimator (OLS), the BCES(y|X) estimator, the FITEXY estimator, and the 
maximum-likelihood estimator (MLE) of the K — 1 gaussian structural model. The solid vertical lines 
mark the true value of (3 — 0.5, and the dashed vertical lines mark the median values of each respective 
estimator. The OLS estimator is biased toward zero, while the FITEXY estimator is biased away from zero; 
in both cases, the bias gets worse for larger measurement errors. The BCES(F|X) estimator is, in general, 
unbiased, but can become biased and highly variable if the measurement errors becomes large. The MLE 
of the Gaussian model performs better than the other estimators, as it is approximately unbiased and less 
variable. 
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Fig. 5. — Same as Figured! but for the standard deviation of the intrinsic scatter, cr. The solid vertical lines 
mark the true value of cr = 0.75, and the dashed vertical lines mark the median values of each respective 
estimator. All of the estimators exhibit some bias, and the BCES and FITEXY estimators can exhibit 
significant variance. Moreover, the BCES and FITEXY estimators both commonly have values of ct = 0, 
misleading one into concluding that there is no intrinsic scatter; this occasionally occurs for the OLS estimate 
as well. In contrast, the MLE based on the Gaussian model does not suffer from this problem, at least for 
these simulations. 
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Fig. 6. — The marginal posterior distributions of the linear correlation coefficient, the regression slope, and 
the intrinsic dispersion for a simulated data set of n = 50 data points with fja, ~ r and cjy ~ a. The vertical 
lines mark the true values of the parameters. The true values of the regression parameters are contained 
within the spread of the marginal posteriors, implying that bounds on the regression parameters inferred 
from the posterior are trustworthy. 
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Fig. 7. — The posterior distributions of the slope (sohd histogram), compared with the posterior approxi- 
mated from the MLE and Fisher information matrix (dashed hne), for a simulated data set of n = 50 data 
points with (3 = 0.5, (Ta; ^ t, and ctj, ~ cr. The two distributions have not converged and the bayesian and 
frequentist inference differ in this case, with the bayesian approach placing more probability near (3 « 0.5 
and on positive value of (3. 



- 35 - 




Fig. 8. — Distribution of ry and ^ (left), and the measured values of y and x (right), from a simulated censored 
data set of n = 50 data points, ax ~ r, and ay ^ a (cf., § I7.2p . In the plot of ry and ^, the solid squares 
denote the values of ^ and 77 for the detected data points, and the hollow squares denote the values of ^ and 
rj for the undetected data points. The solid line in both plots is the true regression line. In the plot of y and 
X, the squares denote the measured values of x and y for the detected data points, and the arrows denote 
the 'upper limits' on y for the undetected data points. The fictitious data point with error bars illustrates 
the median values of the error bars. The dashed-dotted line shows the best fit regression line, as calculated 
from the posterior median of a and /3, and the filled region defines the approximate 95% {la) pointwise 
confidence intervals on the regression line. The true values of the regression line are contained within the 
95% confidence intervals. 
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Fig. 9. — Same as Figure [H but for the censored data set shown in Figure [H The true vahies of the regres- 
sion parameters are contained within the spread of the posteriors, implying that bounds on the regression 
parameters inferred from the posterior are trustworthy. 
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Fig. 10. — The X-ray photon index, Tx, as a function of log Lboi/L Edd for 39 z < 0.8 radio-quiet quasars. 

In both plots the thick solid line shows the posterior median estimate (PME) of the regression line. In the 
left plot, the filled region denotes the 95% {2a) pointwise confidence intervals on the regression line. In the 
right plot, the thin solid line shows the OLS estimate, the dashed line shows the FITEXY estimate, and the 
dot-dashed line shows the BCES(F|X) estimate; the error bars have been omitted for clarity. A significant 
positive trend is implied by the data. 
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Fig. 11. — Same as Figure [6l but for the V x-^og Lboi / L Edd regression. Although the uncertainty on the 
slope and correlation are large, the bounds on them imphed by the data are < /? < 3.5 and 0.2 < p < 1.0. 



Table 1. Dependence of the Estimator Sampling Distributions on Measurement Error and Sample Size 





OLS 




BCBS{Y\X) 


FITEXY 




MLE 


t/r = s/cr» 


0" 


a'' 


$ a 


$ a 


13 


(T 



0.5 25 0.35712:116 0.784tg-7,J^ O.^l^tHf, Omit^.H 0.896ti.lll 0.855tJ:?f, 0.513tliH 0.677tS:rs^o 

50 0.355+_°:\ll 0.801+_°ill O.SlOiS.fsl 0.716l°:f,J 0.898+_°oill 0.873tJ:«- O.SOelS'.t^ 0.717t°:fo? 

100 0.354ig:iil 0.810t°;f,? 0.50612^61 0-743t°:re 0.895iS:^,f, O.^SSt^ZTr 0.504tS;lS 0.732t°;f/, 

1.0 25 0.190«-,f, 0.798tl,Tol 0.442t--? 0.610l-f,« 0.827i-.f,? 0.7271--? 0.524i«-J?^ 0.572l°-l 

50 0.19lt»„:lS 0.839lS;?S 0.519iJ;?i? 0.643tJ:«f3 0.870ljif3 0.814^^;^^, 0.519i«;=f„ 0.669tl-ltl 

100 189+°-^^^ 862+''-^^'' 520+"'^^^ 687+°-^*** 895+"-'"^^ 855 + ^-^'"' 502+''-^^'' 714+° ''^^ 

2.0 25 0.066t»itl 0.565i;:- 0.0361-^- 0.6631-.-;^ 0.4431-.™?, O.OOOl-?- 0.366+_\:tll 0.3811;;-^ 

50 0.067l2:;ra 0.768^1111 0.imtl%l\ 0.743+^?- 0.634;33-^? 0.258l-«f, 0.426;j:°=| 0.559ti:«g 

100 OmntlXll 0.843lJ:lf3 0.209l-.|f, 0.743lJ:?f3 0.765l-.«l 0.627+_r^ 0.444i°:fJ 0.673lS:f.l 



Note. — The values given for /S, and a are the median and interval containing 90% of the estimates over the simulations. For example, 
when t/r = s/cr = 0.5 and n = 25, the median value of the OLS slope estimator is 0.357, and 90% of the values of 0ols are contained within 

U.OO(_^ 246- 

^Typical value of the measurement error magnitude for the simulations. 
^The number of data points in the simulated data sets. 
^The estimate of the slope, {3. The true value is /3 = 0.5. 

'^The estimate of the dispersion in the intrinsic scatter, cr. The true value is a = 0.75. 



